[OpenMP][OMPIRBuilder] Use device shared memory for arg structures #150925

skatrak · 2025-07-28T11:12:52Z

Argument structures are created when sections of the LLVM IR corresponding to an OpenMP construct are outlined into their own function. For this, stack allocations are used.

This patch modifies this behavior when compiling for a target device and outlining parallel-related IR, so that it uses device shared memory instead of private stack space. This is needed in order for threads to have access to these arguments.

llvmbot · 2025-07-28T11:13:25Z

@llvm/pr-subscribers-mlir-openmp
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-mlir

Author: Sergio Afonso (skatrak)

Changes

Argument structures are created when sections of the LLVM IR corresponding to an OpenMP construct are outlined into their own function. For this, stack allocations are used.

This patch modifies this behavior when compiling for a target device and outlining parallel-related IR, so that it uses device shared memory instead of private stack space. This is needed in order for threads to have access to these arguments.

Full diff: https://github.com/llvm/llvm-project/pull/150925.diff

5 Files Affected:

(modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+6)
(modified) llvm/include/llvm/Transforms/Utils/CodeExtractor.h (+34-5)
(modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+89-9)
(modified) llvm/lib/Transforms/Utils/CodeExtractor.cpp (+56-17)
(modified) mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir (+5-5)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 110b0fde863c5..967fe38c0d635 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -2159,7 +2159,13 @@ class OpenMPIRBuilder {
   /// during finalization.
   struct OutlineInfo {
     using PostOutlineCBTy = std::function<void(Function &)>;
+    using CustomArgAllocatorCBTy = std::function<Instruction *(
+        BasicBlock *, BasicBlock::iterator, Type *, const Twine &)>;
+    using CustomArgDeallocatorCBTy = std::function<Instruction *(
+        BasicBlock *, BasicBlock::iterator, Value *, Type *)>;
     PostOutlineCBTy PostOutlineCB;
+    CustomArgAllocatorCBTy CustomArgAllocatorCB;
+    CustomArgDeallocatorCBTy CustomArgDeallocatorCB;
     BasicBlock *EntryBB, *ExitBB, *OuterAllocaBB;
     SmallVector<Value *, 2> ExcludeArgsFromAggregate;
 
diff --git a/llvm/include/llvm/Transforms/Utils/CodeExtractor.h b/llvm/include/llvm/Transforms/Utils/CodeExtractor.h
index 407eb50d2c7a3..cc472a5bf3576 100644
--- a/llvm/include/llvm/Transforms/Utils/CodeExtractor.h
+++ b/llvm/include/llvm/Transforms/Utils/CodeExtractor.h
@@ -17,6 +17,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/SetVector.h"
+#include "llvm/IR/BasicBlock.h"
 #include "llvm/Support/Compiler.h"
 #include <limits>
 
@@ -24,7 +25,6 @@ namespace llvm {
 
 template <typename PtrType> class SmallPtrSetImpl;
 class AllocaInst;
-class BasicBlock;
 class BlockFrequency;
 class BlockFrequencyInfo;
 class BranchProbabilityInfo;
@@ -85,6 +85,10 @@ class CodeExtractorAnalysisCache {
   /// 3) Add allocas for any scalar outputs, adding all of the outputs' allocas
   ///    as arguments, and inserting stores to the arguments for any scalars.
   class CodeExtractor {
+    using CustomArgAllocatorCBTy = std::function<Instruction *(
+        BasicBlock *, BasicBlock::iterator, Type *, const Twine &)>;
+    using CustomArgDeallocatorCBTy = std::function<Instruction *(
+        BasicBlock *, BasicBlock::iterator, Value *, Type *)>;
     using ValueSet = SetVector<Value *>;
 
     // Various bits of state computed on construction.
@@ -133,6 +137,25 @@ class CodeExtractorAnalysisCache {
     // space.
     bool ArgsInZeroAddressSpace;
 
+    // If set, this callback will be used to allocate the arguments in the
+    // caller before passing it to the outlined function holding the extracted
+    // piece of code.
+    CustomArgAllocatorCBTy *CustomArgAllocatorCB;
+
+    // A block outside of the extraction set where previously introduced
+    // intermediate allocations can be deallocated. This is only used when an
+    // custom deallocator is specified.
+    BasicBlock *DeallocationBlock;
+
+    // If set, this callback will be used to deallocate the arguments in the
+    // caller after running the outlined function holding the extracted piece of
+    // code. It will not be called if a custom allocator isn't also present.
+    //
+    // By default, this will be done at the end of the basic block containing
+    // the call to the outlined function, except if a deallocation block is
+    // specified. In that case, that will take precedence.
+    CustomArgDeallocatorCBTy *CustomArgDeallocatorCB;
+
   public:
     /// Create a code extractor for a sequence of blocks.
     ///
@@ -149,7 +172,9 @@ class CodeExtractorAnalysisCache {
     /// the function from which the code is being extracted.
     /// If ArgsInZeroAddressSpace param is set to true, then the aggregate
     /// param pointer of the outlined function is declared in zero address
-    /// space.
+    /// space. If a CustomArgAllocatorCB callback is specified, it will be used
+    /// to allocate any structures or variable copies needed to pass arguments
+    /// to the outlined function, rather than using regular allocas.
     LLVM_ABI
     CodeExtractor(ArrayRef<BasicBlock *> BBs, DominatorTree *DT = nullptr,
                   bool AggregateArgs = false, BlockFrequencyInfo *BFI = nullptr,
@@ -157,7 +182,10 @@ class CodeExtractorAnalysisCache {
                   AssumptionCache *AC = nullptr, bool AllowVarArgs = false,
                   bool AllowAlloca = false,
                   BasicBlock *AllocationBlock = nullptr,
-                  std::string Suffix = "", bool ArgsInZeroAddressSpace = false);
+                  std::string Suffix = "", bool ArgsInZeroAddressSpace = false,
+                  CustomArgAllocatorCBTy *CustomArgAllocatorCB = nullptr,
+                  BasicBlock *DeallocationBlock = nullptr,
+                  CustomArgDeallocatorCBTy *CustomArgDeallocatorCB = nullptr);
 
     /// Perform the extraction, returning the new function.
     ///
@@ -177,8 +205,9 @@ class CodeExtractorAnalysisCache {
     /// newly outlined function.
     /// \returns zero when called on a CodeExtractor instance where isEligible
     /// returns false.
-    LLVM_ABI Function *extractCodeRegion(const CodeExtractorAnalysisCache &CEAC,
-                                         ValueSet &Inputs, ValueSet &Outputs);
+    LLVM_ABI Function *
+    extractCodeRegion(const CodeExtractorAnalysisCache &CEAC, ValueSet &Inputs,
+                      ValueSet &Outputs);
 
     /// Verify that assumption cache isn't stale after a region is extracted.
     /// Returns true when verifier finds errors. AssumptionCache is passed as
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 2e8fb5efb7743..a913958c0de9a 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -268,6 +268,38 @@ computeOpenMPScheduleType(ScheduleKind ClauseKind, bool HasChunks,
   return Result;
 }
 
+/// Given a function, if it represents the entry point of a target kernel, this
+/// returns the execution mode flags associated to that kernel.
+static std::optional<omp::OMPTgtExecModeFlags>
+getTargetKernelExecMode(Function &Kernel) {
+  CallInst *TargetInitCall = nullptr;
+  for (Instruction &Inst : Kernel.getEntryBlock()) {
+    if (auto *Call = dyn_cast<CallInst>(&Inst)) {
+      if (Call->getCalledFunction()->getName() == "__kmpc_target_init") {
+        TargetInitCall = Call;
+        break;
+      }
+    }
+  }
+
+  if (!TargetInitCall)
+    return std::nullopt;
+
+  // Get the kernel mode information from the global variable associated to the
+  // first argument to the call to __kmpc_target_init. Refer to
+  // createTargetInit() to see how this is initialized.
+  Value *InitOperand = TargetInitCall->getArgOperand(0);
+  GlobalVariable *KernelEnv = nullptr;
+  if (auto *Cast = dyn_cast<ConstantExpr>(InitOperand))
+    KernelEnv = cast<GlobalVariable>(Cast->getOperand(0));
+  else
+    KernelEnv = cast<GlobalVariable>(InitOperand);
+  auto *KernelEnvInit = cast<ConstantStruct>(KernelEnv->getInitializer());
+  auto *ConfigEnv = cast<ConstantStruct>(KernelEnvInit->getOperand(0));
+  auto *KernelMode = cast<ConstantInt>(ConfigEnv->getOperand(2));
+  return static_cast<OMPTgtExecModeFlags>(KernelMode->getZExtValue());
+}
+
 /// Make \p Source branch to \p Target.
 ///
 /// Handles two situations:
@@ -702,15 +734,19 @@ void OpenMPIRBuilder::finalize(Function *Fn) {
     // CodeExtractor generates correct code for extracted functions
     // which are used by OpenMP runtime.
     bool ArgsInZeroAddressSpace = Config.isTargetDevice();
-    CodeExtractor Extractor(Blocks, /* DominatorTree */ nullptr,
-                            /* AggregateArgs */ true,
-                            /* BlockFrequencyInfo */ nullptr,
-                            /* BranchProbabilityInfo */ nullptr,
-                            /* AssumptionCache */ nullptr,
-                            /* AllowVarArgs */ true,
-                            /* AllowAlloca */ true,
-                            /* AllocaBlock*/ OI.OuterAllocaBB,
-                            /* Suffix */ ".omp_par", ArgsInZeroAddressSpace);
+    CodeExtractor Extractor(
+        Blocks, /* DominatorTree */ nullptr,
+        /* AggregateArgs */ true,
+        /* BlockFrequencyInfo */ nullptr,
+        /* BranchProbabilityInfo */ nullptr,
+        /* AssumptionCache */ nullptr,
+        /* AllowVarArgs */ true,
+        /* AllowAlloca */ true,
+        /* AllocaBlock*/ OI.OuterAllocaBB,
+        /* Suffix */ ".omp_par", ArgsInZeroAddressSpace,
+        OI.CustomArgAllocatorCB ? &OI.CustomArgAllocatorCB : nullptr,
+        /* DeallocationBlock */ OI.ExitBB,
+        OI.CustomArgDeallocatorCB ? &OI.CustomArgDeallocatorCB : nullptr);
 
     LLVM_DEBUG(dbgs() << "Before     outlining: " << *OuterFn << "\n");
     LLVM_DEBUG(dbgs() << "Entry " << OI.EntryBB->getName()
@@ -1614,6 +1650,50 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createParallel(
                              IfCondition, NumThreads, PrivTID, PrivTIDAddr,
                              ThreadID, ToBeDeletedVec);
     };
+
+    std::optional<omp::OMPTgtExecModeFlags> ExecMode =
+        getTargetKernelExecMode(*OuterFn);
+
+    // If OuterFn is not a Generic kernel, skip custom allocation. This causes
+    // the CodeExtractor to follow its default behavior. Otherwise, we need to
+    // use device shared memory to allocate argument structures.
+    if (ExecMode && *ExecMode & OMP_TGT_EXEC_MODE_GENERIC) {
+      OI.CustomArgAllocatorCB = [this,
+                                 EntryBB](BasicBlock *, BasicBlock::iterator,
+                                          Type *ArgTy, const Twine &Name) {
+        // Instead of using the insertion point provided by the CodeExtractor,
+        // here we need to use the block that eventually calls the outlined
+        // function for the `parallel` construct.
+        //
+        // The reason is that the explicit deallocation call will be inserted
+        // within the outlined function, whereas the alloca insertion point
+        // might actually be located somewhere else in the caller. This becomes
+        // a problem when e.g. `parallel` is inside of a `distribute` construct,
+        // because the deallocation would be executed multiple times and the
+        // allocation just once (outside of the loop).
+        //
+        // TODO: Ideally, we'd want to do the allocation and deallocation
+        // outside of the `parallel` outlined function, hence using here the
+        // insertion point provided by the CodeExtractor. We can't do this at
+        // the moment because there is currently no way of passing an eligible
+        // insertion point for the explicit deallocation to the CodeExtractor,
+        // as that block is created (at least when nested inside of
+        // `distribute`) sometime after createParallel() completed, so it can't
+        // be stored in the OutlineInfo structure here.
+        //
+        // The current approach results in an explicit allocation and
+        // deallocation pair for each `distribute` loop iteration in that case,
+        // which is suboptimal.
+        return createOMPAllocShared(
+            InsertPointTy(EntryBB, EntryBB->getFirstInsertionPt()), ArgTy,
+            Name);
+      };
+      OI.CustomArgDeallocatorCB =
+          [this](BasicBlock *BB, BasicBlock::iterator AllocIP, Value *Arg,
+                 Type *ArgTy) -> Instruction * {
+        return createOMPFreeShared(InsertPointTy(BB, AllocIP), Arg, ArgTy);
+      };
+    }
   } else {
     // Generate OpenMP host runtime call
     OI.PostOutlineCB = [=, ToBeDeletedVec =
diff --git a/llvm/lib/Transforms/Utils/CodeExtractor.cpp b/llvm/lib/Transforms/Utils/CodeExtractor.cpp
index 7a9dd37b72205..a4943150fdffc 100644
--- a/llvm/lib/Transforms/Utils/CodeExtractor.cpp
+++ b/llvm/lib/Transforms/Utils/CodeExtractor.cpp
@@ -25,7 +25,6 @@
 #include "llvm/Analysis/BranchProbabilityInfo.h"
 #include "llvm/IR/Argument.h"
 #include "llvm/IR/Attributes.h"
-#include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/CFG.h"
 #include "llvm/IR/Constant.h"
 #include "llvm/IR/Constants.h"
@@ -265,12 +264,18 @@ CodeExtractor::CodeExtractor(ArrayRef<BasicBlock *> BBs, DominatorTree *DT,
                              BranchProbabilityInfo *BPI, AssumptionCache *AC,
                              bool AllowVarArgs, bool AllowAlloca,
                              BasicBlock *AllocationBlock, std::string Suffix,
-                             bool ArgsInZeroAddressSpace)
+                             bool ArgsInZeroAddressSpace,
+                             CustomArgAllocatorCBTy *CustomArgAllocatorCB,
+                             BasicBlock *DeallocationBlock,
+                             CustomArgDeallocatorCBTy *CustomArgDeallocatorCB)
     : DT(DT), AggregateArgs(AggregateArgs || AggregateArgsOpt), BFI(BFI),
       BPI(BPI), AC(AC), AllocationBlock(AllocationBlock),
       AllowVarArgs(AllowVarArgs),
       Blocks(buildExtractionBlockSet(BBs, DT, AllowVarArgs, AllowAlloca)),
-      Suffix(Suffix), ArgsInZeroAddressSpace(ArgsInZeroAddressSpace) {}
+      Suffix(Suffix), ArgsInZeroAddressSpace(ArgsInZeroAddressSpace),
+      CustomArgAllocatorCB(CustomArgAllocatorCB),
+      DeallocationBlock(DeallocationBlock),
+      CustomArgDeallocatorCB(CustomArgDeallocatorCB) {}
 
 /// definedInRegion - Return true if the specified value is defined in the
 /// extracted region.
@@ -1852,24 +1857,38 @@ CallInst *CodeExtractor::emitReplacerCall(
     if (StructValues.contains(output))
       continue;
 
-    AllocaInst *alloca = new AllocaInst(
-        output->getType(), DL.getAllocaAddrSpace(), nullptr,
-        output->getName() + ".loc", AllocaBlock->getFirstInsertionPt());
-    params.push_back(alloca);
-    ReloadOutputs.push_back(alloca);
+    Value *OutAlloc;
+    if (CustomArgAllocatorCB)
+      OutAlloc = (*CustomArgAllocatorCB)(
+          AllocaBlock, AllocaBlock->getFirstInsertionPt(), output->getType(),
+          output->getName() + ".loc");
+    else
+      OutAlloc = new AllocaInst(output->getType(), DL.getAllocaAddrSpace(),
+                                nullptr, output->getName() + ".loc",
+                                AllocaBlock->getFirstInsertionPt());
+
+    params.push_back(OutAlloc);
+    ReloadOutputs.push_back(OutAlloc);
   }
 
-  AllocaInst *Struct = nullptr;
+  Instruction *Struct = nullptr;
   if (!StructValues.empty()) {
-    Struct = new AllocaInst(StructArgTy, DL.getAllocaAddrSpace(), nullptr,
-                            "structArg", AllocaBlock->getFirstInsertionPt());
-    if (ArgsInZeroAddressSpace && DL.getAllocaAddrSpace() != 0) {
-      auto *StructSpaceCast = new AddrSpaceCastInst(
-          Struct, PointerType ::get(Context, 0), "structArg.ascast");
-      StructSpaceCast->insertAfter(Struct->getIterator());
-      params.push_back(StructSpaceCast);
-    } else {
+    BasicBlock::iterator StructArgIP = AllocaBlock->getFirstInsertionPt();
+    if (CustomArgAllocatorCB) {
+      Struct = (*CustomArgAllocatorCB)(AllocaBlock, StructArgIP, StructArgTy,
+                                       "structArg");
       params.push_back(Struct);
+    } else {
+      Struct = new AllocaInst(StructArgTy, DL.getAllocaAddrSpace(), nullptr,
+                              "structArg", StructArgIP);
+      if (ArgsInZeroAddressSpace && DL.getAllocaAddrSpace() != 0) {
+        auto *StructSpaceCast = new AddrSpaceCastInst(
+            Struct, PointerType ::get(Context, 0), "structArg.ascast");
+        StructSpaceCast->insertAfter(Struct->getIterator());
+        params.push_back(StructSpaceCast);
+      } else {
+        params.push_back(Struct);
+      }
     }
 
     unsigned AggIdx = 0;
@@ -2013,6 +2032,26 @@ CallInst *CodeExtractor::emitReplacerCall(
   insertLifetimeMarkersSurroundingCall(oldFunction->getParent(), LifetimesStart,
                                        {}, call);
 
+  // Deallocate variables that used a custom allocator.
+  if (CustomArgAllocatorCB && CustomArgDeallocatorCB) {
+    BasicBlock *DeallocBlock = codeReplacer;
+    BasicBlock::iterator DeallocIP = codeReplacer->end();
+    if (DeallocationBlock) {
+      DeallocBlock = DeallocationBlock;
+      DeallocIP = DeallocationBlock->getFirstInsertionPt();
+    }
+
+    int Index = 0;
+    for (Value *Output : outputs) {
+      if (!StructValues.contains(Output))
+        (*CustomArgDeallocatorCB)(DeallocBlock, DeallocIP,
+                                  ReloadOutputs[Index++], Output->getType());
+    }
+
+    if (Struct)
+      (*CustomArgDeallocatorCB)(DeallocBlock, DeallocIP, Struct, StructArgTy);
+  }
+
   return call;
 }
 
diff --git a/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir b/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
index 60c6fa4dd8f1e..504e39c96f008 100644
--- a/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-parallel-llvm.mlir
@@ -56,8 +56,6 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
 // CHECK-SAME: ptr %[[TMP:.*]], ptr %[[TMP0:.*]]) #{{[0-9]+}} {
 // CHECK:         %[[TMP1:.*]] = alloca [1 x ptr], align 8, addrspace(5)
 // CHECK:         %[[TMP2:.*]] = addrspacecast ptr addrspace(5) %[[TMP1]] to ptr
-// CHECK:         %[[STRUCTARG:.*]] = alloca { ptr }, align 8, addrspace(5)
-// CHECK:         %[[STRUCTARG_ASCAST:.*]] = addrspacecast ptr addrspace(5) %[[STRUCTARG]] to ptr
 // CHECK:         %[[TMP3:.*]] = alloca ptr, align 8, addrspace(5)
 // CHECK:         %[[TMP4:.*]] = addrspacecast ptr addrspace(5) %[[TMP3]] to ptr
 // CHECK:         store ptr %[[TMP0]], ptr %[[TMP4]], align 8
@@ -65,12 +63,14 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
 // CHECK:         %[[EXEC_USER_CODE:.*]] = icmp eq i32 %[[TMP5]], -1
 // CHECK:         br i1 %[[EXEC_USER_CODE]], label %[[USER_CODE_ENTRY:.*]], label %[[WORKER_EXIT:.*]]
 // CHECK:         %[[TMP6:.*]] = load ptr, ptr %[[TMP4]], align 8
+// CHECK:         %[[STRUCTARG:.*]] = call align 8 ptr @__kmpc_alloc_shared(i64 8)
 // CHECK:         %[[OMP_GLOBAL_THREAD_NUM:.*]] = call i32 @__kmpc_global_thread_num(ptr addrspacecast (ptr addrspace(1) @[[GLOB1:[0-9]+]] to ptr))
-// CHECK:         %[[GEP_:.*]] = getelementptr { ptr }, ptr addrspace(5) %[[STRUCTARG]], i32 0, i32 0
-// CHECK:         store ptr %[[TMP6]], ptr addrspace(5) %[[GEP_]], align 8
+// CHECK:         %[[GEP_:.*]] = getelementptr { ptr }, ptr %[[STRUCTARG]], i32 0, i32 0
+// CHECK:         store ptr %[[TMP6]], ptr %[[GEP_]], align 8
 // CHECK:         %[[TMP7:.*]] = getelementptr inbounds [1 x ptr], ptr %[[TMP2]], i64 0, i64 0
-// CHECK:         store ptr %[[STRUCTARG_ASCAST]], ptr %[[TMP7]], align 8
+// CHECK:         store ptr %[[STRUCTARG]], ptr %[[TMP7]], align 8
 // CHECK:         call void @__kmpc_parallel_51(ptr addrspacecast (ptr addrspace(1) @[[GLOB1]] to ptr), i32 %[[OMP_GLOBAL_THREAD_NUM]], i32 1, i32 -1, i32 -1, ptr @[[FUNC1:.*]], ptr null, ptr %[[TMP2]], i64 1)
+// CHECK:         call void @__kmpc_free_shared(ptr %[[STRUCTARG]], i64 8)
 // CHECK:         call void @__kmpc_target_deinit()
 
 // CHECK: define internal void @[[FUNC1]](

skatrak · 2025-07-28T11:15:23Z

PR stack:

Meinersbur

Passing callbacks around significantly adds complexity. Did you consider deriving a new class from CodeExtractor and overriding some method createAlloca for use by OpenMPIRBuilder?

llvm/include/llvm/Transforms/Utils/CodeExtractor.h

Meinersbur · 2025-07-31T12:45:14Z

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

+        // TODO: Ideally, we'd want to do the allocation and deallocation
+        // outside of the `parallel` outlined function, hence using here the
+        // insertion point provided by the CodeExtractor. We can't do this at
+        // the moment because there is currently no way of passing an eligible
+        // insertion point for the explicit deallocation to the CodeExtractor,
+        // as that block is created (at least when nested inside of
+        // `distribute`) sometime after createParallel() completed, so it can't
+        // be stored in the OutlineInfo structure here.


Could a temporary block be created that is then connected to the CFG later?

Did you have any place in mind as to where a pointer to that block could be stored? The problem I see is that we'd have to link it to the exit block that's paired to the alloca block, which may have been created by the translation of any operation somewhere N levels up the call stack.

I spent about a week trying to get a proper deallocation block to be paired to the OI.OuterAllocaBB, and here's a summary of what I found:

the insertion point used to figure out what the allocation block should be is passed to the createParallel call;

this is obtained in convertOmpParallel, during MLIR to LLVM IR translation, by looking at the nearest OpenMPAllocaStackFrame or by using the function's entry block if no eligible stack frames are found;

OpenMPAllocaStackFrames are created right before translating the body of some operations, which is e.g. the case of omp.distribute, and they will point to a block that has been expressly created for child ops to perform allocations;

other entry/exit blocks might be introduced in between, since not all operations push new OpenMPAllocaStackFrames, so by the time we get to OpenMPIRBuilder::createParallel we have no way of knowing what block is supposed to be the exit for the given outer alloca block; and

a pointer to the eligible deallocation block associated to the alloca block is only currently available at the point where both are created (see e.g. OpenMPIRBuilder::createDistribute).

So, after this previous investigation and looking at it again with fresh eyes, I think that one way we could make this work would be to:

Extend OpenMPAllocaStackFrame to hold alloc and dealloc insertion points.

Add to OpenMPIRBuilder::BodyGenCallbackTy (and potentially to other similar callback types) an additional insert point argument for deallocations.

Update body callbacks creating an OpenMPAllocaStackFrame in MLIR to LLVM IR to store the new deallocation insertion point argument into the stack frame.

Modify OMPIRBuilder translation functions taking the aforementioned callback(s) to pass their exit blocks as deallocation points when generating their body.

Add a new OuterDeallocBB to OpenMPIRBuilder::OutlineInfo and pass it as the deallocation block to the CodeExtractor constructor in OpenMPIRBuilder::finalize.

Update OpenMPIRBuilder::createParallel to take the new deallocIP returned by findAllocaInsertPoint and store it as the OI.OuterDeallocBB.

I think that should work for Flang and let us use the CodeExtractor-provided insert point for both allocations and deallocations, while also doing this in the right spot (i.e. not inside of a loop whenever possible). However, there's also Clang uses of the OMPIRBuilder to contend with, and a quick look tells me we won't have easy access to an exit block as we do for Flang, though I'm not too familiar with it. As it stands, Clang won't run this one code path where we actually use that deallocation insertion point, so passing null around and having a sane default behavior will work, but eventually we'd have to deal with that issue.

Do you have an opinion on this? I can work on those changes as another patch in the stack to hopefully make things a bit better or I could make the changes here, though I think such fundamental changes are better done as an independent PR. Or maybe there's a better way I haven't thought about, let me know.

This was meant as on open question since I do not fully uderstand the problem. The idea with the temporary block was to create an unconnected BB and then later connect/move it to the expected location e.g. in the finalize() method or by the caller of createParallel, though I do not know how they would know where to insert it. That temporary BB could also be created by the caller, have pass it to createParallel (e.g. as deallocIP), then make it the caller's responsibility to connect it. It sounds like you have about the same in mind.

OK to defer it to some later point.

Yeah, what I'm describing is something close to that, but we can't just create a block in the caller to createParallel because the caller would still not know where to link it, since the alloca insertion point could be associated to some parent operation multiple layers up. I'll try to implement that proposal as an independent patch and see if I can make it work.

bhandarkar-pranav

Overall, the approach makes sense to me. thank you for the PR. Only big concern is @Meinersbur's comment on using a derived class of CodeExtractor instead of callbacks. Having said that callbacks are all over the place in OMPIRBuilder.

llvm/include/llvm/Transforms/Utils/CodeExtractor.h

llvm/lib/Transforms/Utils/CodeExtractor.cpp

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

Meinersbur · 2025-08-07T11:11:06Z

Having said that callbacks are all over the place in OMPIRBuilder.

There is a term for it: Callback hell

Argument structures are created when sections of the LLVM IR corresponding to an OpenMP construct are outlined into their own function. For this, stack allocations are used. This patch modifies this behavior when compiling for a target device and outlining `parallel`-related IR, so that it uses device shared memory instead of private stack space. This is needed in order for threads to have access to these arguments.

…e` method due to an invalid builder insertion point

skatrak · 2025-08-14T11:37:02Z

Passing callbacks around significantly adds complexity. Did you consider deriving a new class from CodeExtractor and overriding some method createAlloca for use by OpenMPIRBuilder?

Overall, the approach makes sense to me. thank you for the PR. Only big concern is @Meinersbur's comment on using a derived class of CodeExtractor instead of callbacks. Having said that callbacks are all over the place in OMPIRBuilder.

Thank you @Meinersbur and @bhandarkar-pranav for the reviews, I'll try to explain my though process on this so we can decide on a solution for this problem.

The problem is that the logic inside of the allocator/deallocator will not necessarily be the same for all custom instances. We could create an OMPIRBuilder-local subclass of CodeExtractor holding a reference to the OMPIRBuilder itself, and then subclass it for each case by overriding allocation and deallocation methods. The issue then is that we'd have to know at the point of creating the CodeExtractor from an OutlineInfo struct, in OMPIRBuilder::finalize, which subclass (or base class) we would have to instantiate. So, we'd have to extend OutlineInfo to hold the union of all unique information needed to initialize each OutlineInfo subclass constructor and some enumeration or something else that can be used to select which one to create.

This patch only introduces a custom allocator for parallel when compiling for a target device, which specifically uses __kmpc_{alloc,free}_shared, but in my mind whatever we do needs to allow us to support other custom cases if they arise (e.g. it calls other function(s) with different arguments, it applies to different constructs, etc). It seems to me that callbacks are the simpler solution in this case, though I get that OMPIRBuilder has some next-level spaghetti execution going on as it is and I'm suggesting adding yet another use for callbacks in it.

Let me know if you still think subclassing CodeExtractor is a better way forward for this and I'll give it a try.

Meinersbur · 2025-08-14T13:44:46Z

How many cases/subclasses do you think will eventually be needed? From what I see it might be either shared memory or standard alloca. That is one additional member for OutlineInfo. This PR currently adds already two. I think only one subclass of CodeExtractor is necessary; it can receive a pointer to OutlineInfo and decide in its methods what kind of alloca it wants to create (Although now that you mention it, separate classes might be nicer).

A lambda is already syntactic sugar for a new (anonymopus) class. The additional data captured by it ist still there, but as members of that anonymous class. One may prefer that syntactic sugar over writing classes manually, but becomes problematic if it stops representing the logical structure of the program.

I think the biggest problem with callbacks that are not limited to a local scope is that it is hard to know what is being called. How do you find all possible implementations of it if you need to fix change the semantics? What does the callback have to do or is allowed to do? A typical callback is used to inform another component (e.g. a logger) that something it happened, and the callback will only modify state in the other component that the callback caller is unrelated to. Anything else easily leads to a "devil's contract" where lambdas just do anything because it worked at some point, but the caller was not designed to allow this. Control flow is unpredictable turning a program into spaghetti code. Some say "Callbacks are the modern goto".

In this case there is an obvious structure that point to a functional subclass:

Allocator and deallocator must be paired, you cannot combine them arbitrarily -> same class
There is a default implementation¹ -> virtual method with default implementation in the base class
The callbacks are expected to do something, and the CodeExtractor stops working if they don't -> not a typical callback
The callbacks are expected to do nothing else
OutlineInfo already contains the information the CodeExtractor is supposed to need²
Using shared memory is a reusable feature, other constructs than parallel may want to use it without copy&pasting parallel's lamda.
Using shared memory is a feature that should be discoverable by programmers without looking into parallel's internal working.

I think understandable and structured code is much more important than saving a few lines because lambdas are convenient. Unless we have a depenency inversion problem to solve, additional members to OutlineInfo and allocation implementations can always be added.

Rather checking whether the callback is nullptr, the ctor could assign a default callback function that contains the default alloca creation ↩
That might indicate that OutlineInfo itself could be that derived class, but I wouldn't go that far. Composition is usually preferred over inheritance. ↩

skatrak · 2025-08-14T15:12:26Z

Thank you Michael for elaborating on this. Those are all fair points. I'm not really sure of how many other custom allocators we may end up needing, it might just end up being regular allocas vs device shared memory in the end. It's probably more likely we'd want to reuse the device shared memory allocator than creating a bunch of different custom allocators. So I think I'll work on refactoring this and see how well I can integrate it with everything else.

One problem with the current implementation is that the custom allocator is capturing EntryBB to basically override the alloca insertion point, a limitation related to the other concurrent discussion in this PR. So, in order to not make this dependent on that other potentially big set of changes, I'll have to store it in the OutlineInfo structure temporarily at least.

…eation based on OutlineInfo structures

skatrak · 2025-08-15T15:17:01Z

I just pushed a commit replacing callbacks with CodeExtractor subclasses. I also moved some of the logic to create a CodeExtractor from an OutlineInfo so that hopefully using these custom CodeExtractors doesn't turn out too much of a hassle. Let me know if this works for you @Meinersbur, @bhandarkar-pranav.

Next week I'll try to work on obtaining and using a proper deallocation block, so that we don't have to override it and get everything in better shape for merging. That'll be an additional PR added to the stack.

bhandarkar-pranav

Thank you for reworking the PR, @skatrak. LGTM.

Meinersbur

Nice

Meinersbur · 2025-08-19T10:12:13Z

llvm/lib/Transforms/Utils/CodeExtractor.cpp

+
+Instruction *CodeExtractor::deallocateVar(BasicBlock *, BasicBlock::iterator,
+                                          Value *, Type *) {
+  return nullptr;


Consider a comment such as

// Default alloca instruction created by allocateVar is released implicitly at function exit

llvm/lib/Transforms/Utils/CodeExtractor.cpp

skatrak requested review from jmorse, Meinersbur, tblah, agozillon, DominikAdamski and TIFitis July 28, 2025 11:12

llvmbot added mlir:llvm mlir flang:openmp llvm:transforms clang:openmp OpenMP related changes to Clang labels Jul 28, 2025

This was referenced Jul 28, 2025

[MLIR][OpenMP] Remove Generic-SPMD early detection #150922

Open

[OpenMP][OMPIRBuilder] Add device shared memory allocation support #150923

Open

This was referenced Jul 28, 2025

[MLIR][OpenMP] Support allocations of device shared memory #150924

Open

[OpenMP][OMPIRBuilder] Support parallel in Generic kernels #150926

Open

[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions #150927

Open

skatrak force-pushed the users/skatrak/flang-generic-04-parallel-args branch from 6a97ff2 to 8b34402 Compare July 28, 2025 11:22

Meinersbur reviewed Jul 31, 2025

View reviewed changes

llvmbot added the mlir:openmp label Aug 6, 2025

bhandarkar-pranav reviewed Aug 6, 2025

View reviewed changes

llvm/include/llvm/Transforms/Utils/CodeExtractor.h Outdated Show resolved Hide resolved

llvm/lib/Transforms/Utils/CodeExtractor.cpp Outdated Show resolved Hide resolved

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp Outdated Show resolved Hide resolved

skatrak force-pushed the users/skatrak/flang-generic-03-mlir-shared-mem branch from 0586e88 to 1b7dd6c Compare August 13, 2025 13:17

skatrak added 2 commits August 13, 2025 14:36

Address intermittent ICE triggered from the `OpenMPIRBuilder::finaliz…

4da11cf

…e` method due to an invalid builder insertion point

Address nits

06a2570

skatrak force-pushed the users/skatrak/flang-generic-04-parallel-args branch from 0237b82 to 06a2570 Compare August 14, 2025 11:40

Replace CodeExtractor callbacks with subclasses and simplify their cr…

be95567

…eation based on OutlineInfo structures

bhandarkar-pranav approved these changes Aug 17, 2025

View reviewed changes

Meinersbur approved these changes Aug 19, 2025

View reviewed changes

skatrak mentioned this pull request Aug 21, 2025

[OMPIRBuilder] Add support for explicit deallocation points #154752

Open

[OpenMP][OMPIRBuilder] Use device shared memory for arg structures #150925

Are you sure you want to change the base?

[OpenMP][OMPIRBuilder] Use device shared memory for arg structures #150925

Uh oh!

Conversation

skatrak commented Jul 28, 2025

Uh oh!

llvmbot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skatrak commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Meinersbur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Meinersbur Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

skatrak Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Meinersbur Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

skatrak Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

bhandarkar-pranav left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Meinersbur commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skatrak commented Aug 14, 2025

Uh oh!

Meinersbur commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

skatrak commented Aug 14, 2025

Uh oh!

skatrak commented Aug 15, 2025

Uh oh!

bhandarkar-pranav left a comment

Choose a reason for hiding this comment

Uh oh!

Meinersbur left a comment

Choose a reason for hiding this comment

Uh oh!

Meinersbur Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jul 28, 2025 •

edited

Loading

skatrak commented Jul 28, 2025 •

edited

Loading

Meinersbur commented Aug 7, 2025 •

edited

Loading

Meinersbur commented Aug 14, 2025 •

edited

Loading